Skip to content

Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS#796

Open
Robby955 wants to merge 4 commits intoopenai:mainfrom
Robby955:record/prefill-7gram-ebls-0.6567
Open

Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS#796
Robby955 wants to merge 4 commits intoopenai:mainfrom
Robby955:record/prefill-7gram-ebls-0.6567

Conversation

@Robby955
Copy link
Copy Markdown

@Robby955 Robby955 commented Mar 26, 2026

Record: Empirical Bayes N-gram Mixing -- val_bpb=0.2292

What this does

Instead of hand-tuning alpha multipliers for each n-gram order (my previous submission at 0.2880), I replaced the mixing strategy with Bayesian posterior inference.

The formula:

p(token) = (ngram_count + c * neural_prob) / (total + c)

This is the Dirichlet-Multinomial posterior predictive. The neural model is the prior, n-gram counts are the likelihood, concentration c controls the tradeoff. Applied recursively from bigram up to 15-gram, where each order's smoothed estimate becomes the next order's prior.

A single global concentration (c=5.0) handles the sparse-count problem that previously required hand-tuned per-order multipliers. The improvement is 0.059 BPB, which I didn't expect from replacing 14 tuned parameters with 1.

Results

Seed BPB Artifact
1337 0.22922259 14,845,997
2024 0.22923179 14,860,181
2025 0.22922912 14,846,933
Mean 0.22923
Std 0.000005

Ablation chain

Config BPB Delta
Neural model only (no cache) 1.1745 --
+ 7-gram backoff + prefill 0.6565 -0.518
+ extend to 15-gram 0.6189 -0.038
+ order-adaptive gating 0.4374 -0.182
+ complementary training (alpha=0.50) 0.3707 -0.067
+ per-order multipliers 0.2880 -0.083
+ Dirichlet smoothing (c=5.0) 0.2292 -0.059

What's novel

Using a neural LM as the base measure in hierarchical Bayesian n-gram smoothing. Traditional Bayesian LMs (MacKay & Peto 1995, Teh 2006) use uniform or unigram priors. This is the Dirichlet special case (discount=0) of the Pitman-Yor family, a sibling to Kneser-Ney, not a generalization of it.

What's borrowed

N-gram cache approach from the community (especially @deanbrr, @lukacf, @Asukabot0, @newjordan). Complementary training from @pentxayc. Per-order multiplier concept from @AayushBaniya2006 (now replaced by Dirichlet). The Bayesian smoothing formula itself is textbook.

Compliance

Constraint Limit Actual Status
Train time 600s 560s Pass
Eval time 600s 366s (max across seeds) Pass
Artifact 16,000,000 14,860,181 (max) Pass
Backward-looking cache required yes Pass
Single-pass eval required yes Pass

Technical details

11L transformer (3 shared x 3 loops + 2 unique, EBLS), 512d, 8 heads / 4 KV heads (GQA), complementary n-gram training (alpha=0.5), 15-order recursive Bayesian backoff with concentration=5.0, int6 GPTQ + LZMA compression. ~14.9 MB artifact.

Feedback welcome.

3-seed validated: s1337=0.6565, s2024=0.6570, s2025=0.6565 (mean 0.6567, std 0.0003)
8xH100 SXM, 560s training + ~300s eval, all artifacts under 16MB.

Key innovation: distributed cache pre-fill using pure numpy.
Each GPU rank pre-populates n-gram hash tables with ALL preceding
token positions before scoring, producing results mathematically
identical to single-GPU sequential evaluation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ptive gating

3-seed validated (seeds 1337, 2024, 2025, std 0.0003).
Up from 0.6567 via two innovations: distributed cache pre-fill (-0.31 BPB)
and order-adaptive entropy gating (-0.18 BPB).
@Robby955 Robby955 changed the title Record: 0.6567 BPB — Prefill Cache + 7-Gram Entropy-Adaptive + EBLS Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS Mar 26, 2026
@hypery11
Copy link
Copy Markdown

nice 🔥🔥🔥🔥

Add complementary training (from @pentxayc openai#803) and per-order
multipliers (from @AayushBaniya2006 openai#809) on top of distributed
prefill + 15-gram + order-adaptive gating.

New 3-seed results: 0.28798 / 0.28804 / 0.28810
All seeds under 16MB, training under 560s, eval under 330s.

Updated README with legality hedge, full ablation, credits.
@Robby955 Robby955 changed the title Record: 0.4374 BPB — Distributed Prefill + Order-Adaptive 15-Gram + EBLS Record: 0.2880 BPB — Complementary Training + Per-Order Multipliers + Distributed Prefill + 15-Gram + EBLS Mar 26, 2026
RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 26, 2026
CRITICAL FIX: Previously each of 8 GPU ranks only updated its n-gram
cache with its own 1/8 of scored windows. Now ALL ranks update with
the FULL chunk (same as mixer already does).

PR openai#796 showed this costs ~0.31 BPP: "Without pre-fill, ranks 1-7
start with empty n-gram caches. This costs ~0.31 BPP."

Expected: massive improvement from 8x more n-gram data per rank.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 26, 2026
Full-chunk n-gram cache sharing: 0.6913 -> 0.5865 (-0.105 BPB)
This confirms PR openai#796's finding that rank-local caches lose ~0.1+ BPB.

WARNING: artifact=16.25MB (over 16MB limit for this seed).
Need to increase pruning from 3% to 4%, or reduce bigram_vocab_size,
to ensure all seeds fit.

Eval time: 492s (within budget).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RoyiRa added a commit to RoyiRa/parameter-golf that referenced this pull request Mar 26, 2026
…ipliers

Novel improvement over uniform entropy threshold:
- Per-order entropy center: order 2 → 5.0 (trust only when confused),
  order max → 2.0 (trust even when model is OK)
- Per-order alpha multiplier: order 2 → 0.3× (suppress noise),
  order max → 2.0× (boost precision)
- Linear interpolation between orders for smooth transition

Inspired by PR openai#796's ablation showing -0.182 BPP from order-adaptive
gating alone. Our implementation is continuous (sigmoid per order)
rather than discrete thresholds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… validated)

Replace per-order multipliers with recursive Dirichlet posterior predictive.
Neural model as informative prior, single concentration c=5.0.
3-seed mean: 0.22923 BPB (std 0.000005).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Robby955 Robby955 changed the title Record: 0.2880 BPB — Complementary Training + Per-Order Multipliers + Distributed Prefill + 15-Gram + EBLS Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS Mar 26, 2026
@Robby955
Copy link
Copy Markdown
Author

Updated submission: 0.6567 → 0.2880 → 0.2292 BPB (3-seed mean, std 0.000005).

Replaced per-order multipliers with Dirichlet-Multinomial posterior smoothing (single concentration c=5.0). All logs, code, and submission.json updated in latest commit.

@MatoTeziTanka
Copy link
Copy Markdown

This is one of the cleanest submissions in the competition. Replacing 14 hand-tuned per-order alpha parameters with a single Dirichlet concentration (c=5.0) is elegant — the recursive posterior predictive naturally handles sparsity at high orders without any manual intervention. The math does what entropy thresholds and sigmoid gating are trying to approximate.

The 3-seed std of 0.000005 is also remarkable — tightest we've seen across all submissions.

Nice work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants